Efficient Work Stealing for Portability of Nested Parallelism and Composability of Multithreaded Program
نویسنده
چکیده
We present performance evaluations of parallel-for loop with work stealing technique. The parallel-for by work stealing transforms the parallel-loop into a form of binary tree by making use of method of divide-and-conquer. Iterations are distributed in the leaves procedures of the binary tree, and the parallel executions are performed by stealing subtrees from the bottom of the tree. The work stealing and divide-and-conquer are used to address the portability problem in nested parallelism and composability. By work stealing and divide-and-conquer techniques, fine-grained parallel-for can be implemented without contributing large work overhead. Low work overhead is important as the number of processor could be less than expected. Low overhead and fine-grained of work stealing scheduler makes highly parallel processor cores are able to scale the performance. In addition, the approach used in this work makes efficient nested parallelism is possible. Because of a low overhead, we show that the work stealing and divide-and-conquer deliver good scalability in nested parallel Sparse LU factorization.
منابع مشابه
Executing multithreaded programs efficiently
This thesis presents the theory, design, and implementation of Cilk (pronounced “silk”) and Cilk-NOW. Cilk is a C-based language and portable runtime system for programming and executing multithreaded parallel programs. Cilk-NOW is an implementation of the Cilk runtime system that transparently manages resources for parallel programs running on a network of workstations. Cilk is built around a ...
متن کاملHistory-Based Adaptive Work Distribution
Exploiting parallelism of increasingly heterogeneous parallel architectures is challenging due to the complexity of parallelism management. To achieve high performance portability whilst preserving high productivity, high-level approaches to parallel programming delegate parallelism management, such as partitioning and work distribution, to the compiler and the run-time system. Random work stea...
متن کاملAuto-tuned nested parallelism: A way to reduce the execution time of scientific software in NUMA systems
Scientific and engineering problems are solved with large parallel systems In some cases those systems are NUMA A large number of cores Share a hierarchically organized memory Kernel of the computation for those problems: BLAS o similar Efficient use of kernels a faster solution of a large range of scientific problems Auto Auto-tuned nested parallelism: a way to reduce the execution time of sci...
متن کاملPortable high-performance programs
This dissertation discusses how to write computer programs that attain both high performance and portability, despite the fact that current computer systems have different degrees of parallelism, deep memory hierarchies, and diverse processor architectures. To cope with parallelism portably in high-performance programs, we present the Cilk multithreaded programming system. In the Cilk-5 system,...
متن کاملTruly Nested Data-Parallelism: Compiling SaC for the MicroGrid Architecture
Data-parallel programming facilitates elegant specification of concurrency. However, the composability of data-parallel operations so far has been constrained by the requirement to have only flat dataparallel operation at runtime. In this paper, we present early results on our work to exploit hardware support for nested concurrency to directly map nested data-parallel operations in high-level s...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013